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Constructing Positive Definite Elastic Kernels 
with Application to Time Series Classification 

Pierre-Frangois Marteau, Member, IEEE and Sylvie Gibet, Member, IEEE 

Abstract — This paper proposes some extensions to the worl< on kernels dedicated to string alignment (biological sequence alignment) 
based on the summing up of scores obtained by local alignments with gaps. The extensions we propose allow to construct, from 
classical time-warp distances, what we call summative time-warp kernels that are positive definite if some simple sufficient conditions 
are satisfied. Furthermore, from the same formalism, we derive a time-warp inner product that extends the usual euclidean inner 
product, providing the capability to handle discrete sequences or time series of variable lengths in an Hilbert space. The classification 
experiment we conducted, using either first near neighbor classifier or Support Vector Machine classifier leads to conclude that 
the positive definite elastic kernels we propose outperform the distance substituting kernels for some classical elastic distances we 
tested. In a similar way, for the considered task, the kernel based on the distance induced by the time-warp inner product significantly 
outperforms the kernel based on the Euclidean distance. 

Index Terms — Elastic distance, Time warp kernel. Time warp inner product, Definiteness, Time series classification, SVM. 
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1 Introduction 

ELASTIC similarity measures such as Dynamic Time 
Warping (DTW) or Edit Distances have proved to 
be quite efficient compared to non elastic similarity 
measures such as Euclidean measures or LP norms 
when addressing tasks that require the matching of 
time series data, in particular time series clustering and 
classification. A wide scope of applications as in physics, 
chemistry, finance, bio-informatics, network monitoring, 
etc, have demonstrated the benefits of using elastic 
measures. A natural follow-up to the elaboration of 
elastic measures is to examine whether or not it is 
possible to construct Reproducing Time Warp Hilbert 
Spaces (RTWHS) from a given elastic measure, basically 
vector spaces characterized with inner products having 
time-warp capabilities. Another intriguing question is 
to determine whether it is possible or not to define 
an inner product structure from which a given elastic 
measure is induced? This question, apart from its theo- 
retical implication, has a great impact when considering 
the potential application fields, since, if the answer is 
positive, it provides direct access to the Linear Algebra 
results and tools. 

Unfortunately it seems that common elastic measures 
that are derived from DTW or Edit Distance are not 
directly induced by an inner product of any sort, even 
when such measures are metrics. One can conjecture that 
it is not possible to embed time series in an Hilbert 
space having a time-warp capability using these classical 
elastic measures. 
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This paper aims at exploring this issue and proposes 
Time Warp Kernels (TWK) constructions that try to pre- 
serve the properties of elastic measures from which they 
are derived, while offering the possibility possibility of 
embedding time series in Time Warped Hilbert Spaces. 
The main contributions of the paper are as follows 

1) we establish the indefiniteness of the main time- 
warp measures used in the literature, 

2) we propose some methods to construct positive 
definite kernels from classical time-warp measures, 

3) we define simple Time Warp Inner Product (TWIP) 
as an extension to the Euclidean Inner Product, and 

4) we experiment and compare the proposed kernels 
on time series classification tasks using a large va- 
riety of time series datasets. 

The paper is organized as follows: the second sec- 
tion of the paper synthesizes the related works; the 
third section introduces the notation and mathematical 
backgrounds that are used throughout the paper; the 
fourth section addresses the non definiteness of classical 
elastic measures that prevents the direct construction 
of an inner product from these measures. The fifth 
section develops the construction of some TWK and 
TWIP from classical elastic measures and discusses their 
potential benefits. The sixth section gathers clustering 
and classification experimentations on a wide range of 
time series data and compares TWK and TWIP accu- 
racies with classical elastic and non elastic measures. 
The seventh section proposes a conclusion and further 
research perspectives. Appendix|A]gives the proof of otii 
main results. 

2 Related works 

During the last decades, the use of kernel-based methods 
in pattern analysis has provided numerous results and 
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fruitful applications in various domains such as biology, 
statistics, networking, signal processing, etc. Some of 
these domains, such as bioinf ormatics, or more generally 
domains that rely on sequence or time series models, 
require the analysis and processing of variable length 
vectors, sequences or timestamped data. Various meth- 
ods and algorithms have been developed to quantify the 
similarity of such objects. From the original djniamic 
programming [2J implementation of the symbolic edit 
distance ITSl by Wagner and Fisher ISTII , the Smith and 
Waterman (SW) algorithm [27] has been designed to 
evaluate the similarity between two symbolic sequences 
by means of a local gap alignment. More efficient local 
heuristics have since been proposed to meet the massive 
symbolic data challenge, such as BLAST [IJ or FASTA 
iri9l . Similarly, dynamic time warping measures have 
been developed to evaluate similarity between numeric 
time series or timestamped data [30J, [22J, and more 
recently 161, fT7| propose elastic metrics dedicated to 
such numeric data. 

Our capability to construct kernels with elastic or time- 
warp properties from such an elastic distance allowing to 
embed time series in vector spaces (Euclidean or not) has 
attracted attention (e.g. [9|[12J[10|) since significant ben- 
efits are expected from potential applications of kernel- 
based machine learning algorithms to variable length 
data, or more generally data for which some elastic 
matching has a meaning. Among the kernel machine 
algorithms applicable to discrimination or regression 
tasks, Support Vector Machines (SVM) are reported to 
yield state-of-the art performances. 

SVM or vast margin classifiers [28l, (4), [24^ are a 
set of supervised algorithms that learn how to solve 
discrimination or regression problems from positive and 
negative examples. They generalize linear classification 
algorithms by integrating two concepts: the maximal 
margin principle and a kernel function that defines 
the similarity or dissimilarity of any pair of examples, 
typically such as an inner product between the vector 
representation of two examples. 

The definition of 'good' kernels from known elastic or 
time-warp distances applicable to data objects of variable 
lengths has been a major challenge since the 1990s. The 
notion of 'goodness' has rapidly been associated to the 
concept of definiteness. Basically SVM algorithms in- 
volve an optimization process whose solution is proved 
to be uniquely defined if and only if the kernel is positive 
definite: in that case the objective function to optimize is 
quadratic and the optimization problem convex. Nev- 
ertheless, if the definiteness of kernels is an issue, in 
practice, many situations exist where definite kernels 
are not applicable. This seems to be the case for the 
main elastic measures traditionally used to estimate the 
similarity of objects of variable lengths. A pragmatic 
approach consists of using indefinite kernels, although 
contradictory results have been reported about the im- 
pact of definiteness or indefiniteness of kernels on the 
empirical performances of SVMs. The sub-optimality of 



the non-convex optimization process is possibly one of 
the causes leading to these un-guaranteed performances 
I32J , |9]. Regulation procedures have been proposed to 
locally approximate indefinite kernel functions by defi- 
nite ones with some benefits. Among others, some ap- 
proaches apply direct spectral transformations to indef- 
inite kernels. This methods l33ll consist in i) flipping the 
negative eigenvalues or shifting the eigenvalues using 
the minimal shift value required to make the spectrum 
of eigenvalues positive, and ii) reconstructing the kernel 
with the original eigenvectors in order to produce a posi- 
tive semidefinite kernel. Yet, in general, 'convexification' 
procedures are difficult to interpret geometrically and 
the expected effect of the original indefinite kernel may 
be lost. Some theoretical highlights have been provided 
through approaches that consist in embedding the data 
into a pseudo-Euclidean (pE) space and in formulating 
the classification problem with an indefinite kernel, such 
as that of minimizing the distance between convex hulls 
formed from the two categories of data embedded in 
the pE space [lOJ. The geometric interpretation results in 
a constructive method allowing for the understanding, 
and in some cases the prediction of the classification be- 
havior of an indefinite kernel SVM in the corresponding 
pE space. 



Some work like ||26|, (TH addresses the construction 
of elastic kernels for time series analysis through a 
decomposition of time series as a sum of local low degree 
polynomials and, using a resampling process the piece- 
wise approximation of the time series are embedded into 
a proper so-called Reproducing Kernel Hilbert Space in 
which the SVM is learned. 

Our approach is more direct, as it tries to use directly 
the elastic distance into the kernel construction, without 
any approximation or resampling process. It is founded 
on the work of Haussler on convolution kernels [11] 
defined on a set of discrete structures such as strings, 
trees, or graphs. The iterative method that is developed 
is generative, as it allows for the building of complex 
kernels from the convolution of simple local kernels. 
Following the work of Haussler flTl , Saigo et al Ell 
define, from the Smith and Waterman algorithm f^T"!, 
a kernel to detect local alignment between strings by 
convolving simpler kernels. These authors show that 
the Smith and Waterman distance measure, dedicated 
to determining similar regions between two nucleotide 
or protein sequences, is not definite, but nevertheless is 
nevertheless connected to the logarithm of a point-wise 
limit of a series of definite convolution kernels. In fact, 
these previous studies have very general implications, 
the first being that classical elastic measures can also be 
understood as the limit of a series of definite convolu- 
tion kernels. We generalize to some extent the results 
presented by Saigo et al. on the Smith and Waterman 
algorithm and propose extensions to construct time- 
warp inner products. 
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3 Notations and mathematical back- 
grounds 

To ensure that this paper is relatively self-contained, we 
give in this section commonly used definitions, with few 
details, for metric, kernel and definiteness, sequence set, 
and classical elastic measures. 

3.1 Metric 

Definition 3.1: A metric, also called a distance, on a 
set U is a function S : U x U ^ M. which satisfies the 
following axioms: 
For all (x, y) £ U x U, 

1) S{x,y) > (non negativity) 

2) d{x,y) — ii and only if x=y. (null iff identical) 

3) S{x, y) = 5{y, x) (symmetry) 

4) 5{x,z) < S{x,y) + S{y,z). (subadditivity/triangle 
inequality) 



3.2 Kernel and definiteness 

A very large literature exists on kernels, among which 
[31, ||24J and [25J present a large synthesis of major 
results. We give hereinafter some basic definitions. 

Definition 3.2: A kernel on a non empty set U refers to 
a complex (or real) valued symmetric function Lp{x,y) : 
C/ X C/ -^ C (or M). 

Definition 3.3: Let U he a non empty set. A function 
9? : C/ X t/ — > C is called a positive (resp. negative) 
definite kernel if and only if it is Hermitian (i.e. 
(p{x,y) = (f{y,x) where the overline stands for 
the conjugate number) for all x and y in U and 
Yllj=iCtCj'fi{xi,Xj) > (resp. E"j=i CjCj'^(a;»,Xj) < 0), 
for all n in N, (a;i,a;2, ...,x„) € C/" and (ci,C2, ...,c„) G C". 

Definition 3.4: Let U he a non empty set. A function 
(^ : [/ X [/ ^- C is called a conditionally positive 
(resp. conditionally negative) definite kernel if and 
only if it is Hermitian (i.e. Lp{x,y) = (p{y,x) for 
all X and y in U) and J^^j^i^i'^jVi^i^^j) ^ 
(resp. J2^j=i^i^j'Pi^i'^j) — 0), for all n > 2 in N, 
(a;i,X2,...,x„) e t/" and (ci, C2, ...,c„) G C" with 

In the last two above definitions, it is easy to show that 
it is sufficient to consider mutually different elements in 
U, i.e. collections of distinct elements xi,a;2, ...,x„. This 
is what we will consider for the remaining of the paper. 

Definition 3.5: A positive (resp. negative) definite ker- 
nel defined on a finite set U is also called a positive (resp. 
negative) semidefinite matrix. Similarly, a positive (resp. 
negative) conditionally definite kernel defined on a finite 
set is also called a positive (resp. negative) conditionally 
semidefinite matrix. 

For convenience sake, we will use PD and CPD for 
positive definite and conditionally positive definite to 



characterize either a kernel or a matrix having these 
properties. 

Constructing PD kernels from CPD kernels is quite 
straightforward. For instance, ii —(p is a CPD kernel on a 

set U and xq ^ U then [3] ^p{x, y) = ip{x, xq) + (p{y, xq) — 
(p{x,y) — ip{xo,xa) is a PD kernel, so are e'-''''^'^^' and 
Q-vix,v)_ jYie converse is also true. 

Furthermore, e-^-f^'-^'V^ is PD f or i > if -ip is CPD. We 
will precisely use this last results to construct PD kernels 
from classical elastic distances. 

3.3 Sequence set 

Definition 3.6: Let U be the set of finite sequences 
(symbolic sequences or time series): U = {^i|p G N}. 
A^ is a sequence with discrete index varying between 1 
and p. We note fl the empty sequence (with null length) 
and by convention Af — Q so that SI is a member of set 
U. \A\ denotes the length of the sequence A. Let Up = 
{y4 G U I |yl| < p} be the set of sequences whose length 
is shorter or equal to p. 

Definition 3.7: Let ^ be a finite sequence. Let A{i) he 
the i*^^ element (symbol or sample) of sequence A. We 
will consider that A{i) E S x T where S embeds the 
multidimensional space variables (either symbolic or 
numeric) and T c M embeds the time stamp variable, so 
that we can write A{i) = (a(i),i£i(i)) where a{i) G S and 
G T, with the condition that t^/j) > t^/,) whenever 



a{i) 



I > j (time stamps strictly increase in the sequence of 
samples). Al with i < j is the subsequence consisting 
of the ith through the jth element (inclusive) of A. So 
Al = A{i)A{i + l)...A{j). A denotes the null element. A{ 
with i > j is the null time series, e.g. il. 



3.4 General Edit/Elastic distance on a sequence set 

Definition 3.8: An edit operation is a pair 
(a, b) y^ (A, A) of sequence elements, written a ^f h. 
Sequence B results from the application of the edit 
operation a ^ 6 into sequence A, written A ^ B via 
a ^ 6, if j4 = aar and B = abr for some subsequences 
a and r. We call a — > 6 a match operation if a 7^ A and 
6 7^ A, a delete operation if 6 = A , an insert operation 
if a = A. 

For any pair of sequences A^, B\, for which we 
consider the extensions A^, Bq whose first element 
is the null symbol A, and for each elementary edit 
operation related to position < i < p in sequence A 
and to position < j < g in sequence B is associated 
a cost value TA(r)^B{j}iA''i,Bl), or Ta(i)^a,j{A{,BI) 
or Tf^^^B{j)i^iiB'[) G M. To simplify this writing we 
will simply write T{A{i) -^ B{j)), T{A{i) -J> A) or 
r(A — > B{i)) although this will not be fully appropriate 
in general. 
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Definition 3.9: A function (5 : U x U ^ M is called an 
edit distance defined on U if, for any pair of sequences 
AI^Bf, the following recursive equation is satisfied 

r SiA{-\B^^) + r{A{p)^A) delete 

Min<^ S{AP~\Br^)+T{A{p)-^B{q)) match 

[ S{AI,Br^) + r{A^B{q)) insert 

Note that not all edit/elastic distances are metric. In 
particular, the dynamic time warping distance does not 
satisfy the triangle inequality. 

3.4. 1 Levenshtein distance 

The Levenshtein distance Siev{x,y) has been defined for 
string matching. For this edit distance, the delete and 
insert operations induce unitary costs, i.e. T{A{p) — > 
A) — tIA ^ B{q)) = 1 while the match cost is null if 
A{p) — B{q) or 1 otherwise. 



3.4.2 Dynamic time warping 

The DTW similarity measure Sdtw IH 
according to the previous notations as: 



5dUA\,Bl] 



is defined 



Min' 



dLp{ap,hq) 

5dt^[Ar\Bl) 

5dUAr\Br') 

S.UAlBr') 

where dLp{a{p), b{q) is the LP norm in R'', and so for 
DTW, T[A{p) ^ A) = T{A[p) ^ B{q)) = r(A ^ B{q)) = 
dLp{a'{p),b{q)). Let us note that the time stamp values 
are not used, therefore the costs of each edit operation 
involve vectors a and b in S instead of vectors {a,ta) 
and {b, tb) in S xT. One of the main restrictions of Sdtw 
is that it does not comply with the triangle inequality as 
shown in |i6J. 

3.4.3 Edit Distance witti real penalty 

Oerp[Ai, B-^) = 

( 6erp{Al''^,Bl) + T{A{p)^ K) insert 

Mini 5erp{Al~^,Bl'^) + T{A{p)^B{q)) match 

i 5erp{Al,Bl''') + T{A^B{q)) delete 

with 

T{A{p)^A)=dLp{a{p),g) 

T{A{p) ^ B{q) = dLp{a{p),b{<D 
T{A^B{q))=dLp{g,b{q)) 

where g is a constant in S and dLp{x, y) is the Lp norm 
of vector {x — y) in 5. 

Note that the time stamp coordinate is not taken into 
account, therefore Serp is a distance on S but not on S'x T. 
Thus, the cost of each edit operation involves vectors a 
and 6 in K'^ instead of vectors {a, to) and {b,tn) in R*''+^. 

According to the authors of ERF |6|, the constant g 
should be set to for some intuitive geometric interpre- 
tation and in order to preserve the mean value of the 
transformed time series when adding gap samples. 



3.4.4 Time warp edit distance 

Time Warp Edit Distance (TWED) fH], [17] is defined 
similarly to the edit distance defined for string [ISJPIJ. 
The similarity between any two time series A and B of 
finite length, respectively p and q is defined as: 

StwediA^^Bl)^ 

( StwediAr\Bl)+riAip)^A) deletcA 

Min<^ Stwed{Ar\Br^)+TiA{p)^Biq)) match 
[ 6t^ediAP,Br^) +TiA ^ B{q)) deletes 

with 

r{Aip)^A)^diA{p),A{p^l)) + X 

TiA{p) ^ B{q)) = diAip),B{q)) + d{A{p - 1), Biq - 1)) 

r{A^B{q))=diBiq),Biq-l)) + X 

The time stamps are exploited to evaluate 

d{A{p),B{q)). In practice, d(A(p),B{q)) 
dLp{a{p),b{q)) + v ■ dLp{ta[p),tt^(q)) where A is a 
positive constant that represents a gap penalty and 
i^ is a non negative constant which characterizes the 
stiffness of the Stwed elastic measure. 

3.5 Indefiniteness of elastic distance Icernels 

In appendix [Aj we give counter examples, one for each 
elastic distance we have previously defined, showing 
that these distances do not lead to definite kernels. 

This demonstrates that the metric properties of a dis- 
tance defined on U, in particular the triangle inequality, 
are not sufficient conditions to establish definiteness 
(conditionally or not) of the associated distance kernel. 
One could conjecture that elastic distances cannot be 
definite (conditionally or not), possibly because of the 
presence of the max or min operators in the recursive 
equation. In the following sections, we will see that 
replacing these min or max operators by a sum operator 
allows, under some conditions, for the construction of 
series of positive definite kernels whose limit is quite 
directly connected to the previously addressed elastic 
distance kernels. 

4 Constructing positive definite ker- 
nels FROM ELASTIC DISTANCE 

The main idea leading to the construction of positive 
definite kernels from a given elastic distance defined on 
U is to replace the min or max operator into the recursive 
equation defining the elastic distance by a ^ operator. 
Instead of keeping one of the best alignment paths, the 
new kernel will sum up all the subsequence alignments 
with some weighting factor that could be optimized. This 
has been done successfully for the Smith and Waterman 
symbolic distance that is also known to be indefinite [21) 
and more recently for dynamic time warping kernels for 
time series alignment IS]. In the following sub sections, 
we propose some generalizations and extensions of these 
results that we confront in some time series classification 
experiments. 
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4.1 Summative Time Warped Kernels 

Definition 4.1: A function < .;. >:UxU^Mis called 
a Summative Time Warp Kernel (STWK) if, for any pair 
of sequences A\,B\, there exists a function / : M ^- M 
such that the following recursive equation is satisfied: 



<Al-Bl>^ 

<Ar\Bl>^f{T{A{p) 



E <Ar 



,B 



9-1 



A)) 



delete 



> *f{T{A{p) -^ B{q))) match 



<AlBr'>*f{r{A^Biq))) 



insert 



Where • is either the addition or the multiplication. 
This recursive definition requires to define an initializa- 
tion. To that end we set < il,fl >— ^, where ^ is a real 
constant, and 17 the null sequence. 

4.1.1 Interpretation of STWK 

To interpret STWK we need first to introduce the concept 
of alignment path between two sequences or time series. 

Definition 4.2: An {N, M)-warping path is a sequence 

p = {pi,--- ,Pl) with pi = {ni,mi) G {I,--- , N} x 

{1, • • • , M} for i G {I,- ■ ■ ,L} satisfying the following 

three conditions 

i) Boundary condition: pi ~ (1, 1) and pL — {N, M). 

ii) Monotonicity condition: rii < 7^2, • • • < til and mi < 

TO2 • • • < m^. 
iii) Step size condition: rii^+i > m,; or (inclusive) 
rii+i > Hi for i e {1, • • • , L — 1}. 

Summative refers to the ^ operator replacing the min 
or max usually used. The recursion is initialized using 

< a;,^? >=<n,n>=^GR. 

This type of kernel sums up the multiplication or 
addition of the local quantities f{T{A{i) — > B{j))) for 
all the possible alignment paths between the two time 
series. 

Definition 4.3: If -k is the addition, the STWK is called 
additive, otherwise it will be called multiplicative. 



4.1.2 Definiteness of STWK 

The following theorem, which generalizes the one 
presented in [8] establishes sufficient conditions on 
/(r(a — > b)) for an STWK to be definite and thus is a 
basis for the construction of definite STWK. 

Theorem 4.4: Definiteness of STWK: 

i) If the local kernel k{x,y) = f{T{x — )■ y)) is positive 
definite on ((S' x T) U {A})^ and if C > (^ > for 
multiplicative STWK), then the resulting STWK is 
positive definite on U x U . 

ii) An additive STWK is negative definite on U if 
the local kernel k{x,y) = f{r{x -^ y)) is negative 



definite on {{S x T) U {A})^ and ^ < 0. 

iii) An additive STWK is conditionally positive definite 
if the local kernel k{x,y) — f{T{x -^ y)) is 
conditionally positive definite on {{S x T) U {A}) 
and if C > 0. 



/i\A sketch of proof for theorem 
appendix |Bl 



is given in the 



As the cost function V is, in general, conditionally 
negative definite, choosing f{h) for the exponential 
ensures that f{T{x -^ y)) is a positive definite kernel 
(23^. Other functions can be used, such as the Inverse 
Multi Quadric kernel k(x,y) = —. i — = . As 

with the exponential (Gaussian or Laplace) kernel, 
the Inverse Multi Quadric kernel results in a positive 
definite matrix with full rank [18] and thus forms a 
infinite dimension feature space. 



4.1.3 Computational cost of STWK 

Although the number of paths that are summed up 
exponentially increases with the lengths \A\ and \B\ of 
the time series that are evaluated, the recursive computa- 
tion of STWK {A, B) leads to a quadratic computational 
cost iO{\A\ ■ \B\), e.g. 0{N'^) if N is the average length 
of the time series that are considered. This quadratic 
complexity can be reduced to a linear complexity by 
limiting the number of alignment paths to consider in the 
recursion. This can be achieved when using a search cor- 
ridor | |22) as far as the kernel remains symmetric, which 
is the case when processing time series of equal lengths 
and restraining the search space using, for instance, a 
fixed size corridor symmetrically displayed around the 
diagonal as shown in Fig. [l] 

4.2 Some instances of additive and multiplicative 
STWK 

4.2. 1 Multiplicative exponentiated STWK 

Definition 4.5: 



<AP-\Bf> 

E<! <Ar\Br'> 
<AiBr'> 



me -^ 

g->.'.r(A(p)^S(g)) 



(2) 



7ne 



me 



,-iy'-r(A^B(q)) 



where u' is a stiffness parameter that weighs the 
contribution of the local elementary costs. The larger i/' 
is, the more the kernel is selective around the optimal 
paths. At the limit, when v' -^ oo, only the optimal path 
costs are summed up by the kernel. Note that, as is 
generally seen, several optimal paths leading to the same 
global cost exist, lim^'^+oo — l/i'' • log{< A, B >me) does 
not coincide with the elastic distance 6 that involves the 
same corresponding elementary costs. 
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Fig. 1 . Example of a symmetric corridor used to reduced 
the number of admissible alignment paths 



We suggest setting ^ = 1 for this kind of multiplicative 
STWK. 

Proposition 4.6: Definiteness of the niultiplicative 
exponentiated STWK < ., . >me' 
< ., . >me is positive definite for the cost functions 

r(A(p) -> A), T{A{p) ^ Biq)) and r(A ^ B{q)) 
involved in the computation of the Siev, 5dtwr ^erp and 
Stwed distances. 

The multiplicative SWTKs constructed from these 
distances are referred respectively to STWKiev, 
STWKerp, STWKdtw, STWKty,ed in the rest of the 
paper. 

The proof of proposition 14.61 is straightforward and is 
omitted. 



4.2.2 Interpretation of multiplicative STWK 
For multiplicative STWK each alignment path is 
assigned with a cost that is the multiplication of the 
local costs attached to each edge of the path. For 
multiplicative exponentiated STWK, the local cost, e.g. 
g-jy ■r(A(p)^B{q)) ^^^ ]^g interpreted, up to a normalizing 
constant, as a probability to align symbol A{p) with 
symbol B{q), and the cost affected to each path can be 
interpreted as the probability of a specific alignment 
between two sequences. In that case the STWK, that 
sums up the probability of all possible alignment 



paths between two sequences, can be interpreted as 
a matching probability between two sequences. This 
probabilistic interpretation suggests an analogy between 
multiplicative STWK and the alpha-beta algorithm 
designed to learn HMM models: while the Viterbi's 
algorithm that uses a max operator in a dynamic 
programming implementation (just like the DTW 
algorithm) evaluates only the probability of the best 
alignment path, the alpha-beta algorithm is based on the 
summation of the probabilities of all possible alignment 
paths. As reported in [21J, the main drawbacks of these 
kind of kernels is the vanishing of the product of local 
costs (that are lower than one) when comparing long 
sequences. When considering gram-matrix (pair-wise 
distances on finite sets) this leads to a matrix that 
suffers from the diagonal dominance problem, i.e. the 
fact that the kernel value decreases extremely fast when 
the similarity slightly decreases. 



4.2.3 Additive STWK 

Although a very large family of distinct additive STWK 
exists, we present below two simple instances of ad- 
ditive STWK that correspond to generalizations of the 
Euclidean inner product. 
Definition 4.7: 



^ /IP R9 ^^ . _ 1. 

E< <A\-\B 



9-1 



> 



twip 



, +e-''-''(*<.(p)^*K,))(a(p)-6(g)) 



where d is a distance, and v a stiffness parameter. 

Definition 4.8: 
^ AP B'i ->. ■ — 1 

E<! <A\-\B 



9-1 



> 



twip2 



_e-^.d(ta(p),tb(,))(a(p) .b{q)) 



6 • < /i-|^,_D2 ^tiuip2 

where d is a distance, and i/ a stiffness parameter. 

We propose taking ^ = for these two additive SWTK. 

4.2.4 Interpretation of additive S TWK 
For additive STWK each alignment path is assigned with 
a cost that is the addition of the local costs attached to 
each edge of the path. We cannot maintain a probabilistic 
interpretation for local costs of the form e^'^'"(^(p)^-^(9)). 
Nevertheless the additive STWK does not suffer from 
the diagonal dominance problem mentioned above and 
can be easily normalized. Furthermore, additive STWK 
can be viewed as a generalization of the standard inner 
product. In particular, note that for i/ — > oo, twip2, when 
applied to a pair of time series of equal lengths and 
identically sampled, identifies with the inner product in 
Euclidean spaces. 



(3) 



(4) 
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Fig. 2. When considering tine discrete time series 

A - (0,1)(2,2)(0,3)(0,4)(0,5)(3,6)(0,7)(0,8)(0,9) and 
B = (0,1)(0,2)(0,3)(2,4)(0,5)(0,6)(0,7)(3,8)(0,9), the 
Euclidean inner product is null, while, for v = .1, the twipi 
inner product of A and B equals to .459, and the twip2 
inner product of A and B equals to .475. For v = 100, 
both TWIP give a null value. 



Let U^ be the set of all time series sequences whose 
lengths is N and whose elements are selected in 
S X {ti < t2 < ■ ■ ■ < tpf}, and consider the additive 
operator © and (g) defined below: 

Definition 4.9: For all AeV'^ and all A e M, C = A(g)A 
e U^ is such that for allO < i < N, C{i) = (A.a(i),t,) 
and thus \C\ = \A\ = N. 

Definition 4.10: For all {A,B) e (V^)^, C = A® B 



e U^ is defined such that for all < 



< N, 



Cii) = {a{i) + b{i),U) and thus \C\ = \A\ = \B\ = N. 



Proposition 4.11: Definiteness and 
structure of the additive STWK: 



inner product 



i) V7VeN+, < .,. 
definite on U^ 



>twipi and < ., . >twip2 are positive 



^twip2 



are inner 



ii) Furthermore, < ., . >twipi and < ., . >t 

products on (U^,®,®), that we call Time Warp 
Inner Products (TWIP), where and are defined 
in definition 14.101 and definition 14.91 respectively, 

iii) The Euclidean inner product < ., . >e on a set 
of time series of constant lengths and uniformly 
sampled is the limit when t/ — > oo of < ., . >twip2 
on this same set. 



5 Classification experiments 

We empirically evaluate the effectiveness of some STWK 
comparatively to Gaussian Radial Basis Function (RBF) 
Kernels or elastic distance substituting kernels |9J using 
some classification tasks on a set of time series coming 
from quite different application fields. The classification 
task we have considered consists of assigning one of 
the possible categories to an unknown time series for 
the 20 data sets available at the UCR repository [|13J. As 
time is not explicitly given for these datasets, we used 
the index value of the samples as the time stamps for 
the whole experiment. 

For each dataset, a training subset (TRAIN) is defined 
as well as an independent testing subset (TEST). We use 
the training sets to train two kinds of classifiers: 

• the first one is a first near neighbor (1-NN) classifier: 
first we select a training data set containing time 
series for which the correct category is known. To 
assign a category to an unknown time series selected 
from a testing data set (different from the train set), 
we select its nearest neighbor (in the sense of a 
distance or similarity measure) within the training 
data set, then, assign the associated category to its 
nearest neighbor. For that experiment, a leave one 
out procedure is performed on the training dataset 
to optimize the meta parameters of the considered 
comparability measure. 

• the second one is a SVM classifier ||4|, Il29l 
configured with a Gaussian RBF kernel whose 
parameters are C > 0, a trade-off between 
regularization and constraint violation and a that 
determines the width of the Gaussian function. To 
determine the C and a hyper parameter values, 
we adopt a 5-folded cross-validation method on 
each training subset. According to this procedure, 
given a predefined training set TRAIN and a test 
set TEST, we adapt the meta parameters based on 
the training set TRAIN: we first divide TRAIN into 
5 stratified subsets TRAINi,TRAIN2, ,TRAIN^; 
then for each subset TRAINi we use it as a new 
test set, and regard {TRAIN - TRAINi) as a 
new training set; Based on the average error rate 
obtained on the five classification tasks, the optimal 
values of meta parameters are selected as the ones 
leading to the minimal average error rate. 

We have used the LIBSVM library [5J to implement 
the SVM classifiers. 



Note that any discrete time series space U^ (a set of 
time series of length N that are uniformly sampled), 
when provided with the ® and ® operators and the 
metric (norm) induced by a TWIP, is a Fiilbert space. 
The proof of proposition 14.111 is straightforward and is 
omitted. 
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TABLE 1 
Dataset sizes and meta parameters used in conjunction witin Ked and STWKtwip^ kernels 



DATASET 


length 1 #class 1 #train 1 #test 


Ked ■■ C, <j 


STWKtwip2 ■■ i^,C,(T 


Synthetic control 


60|6|300 300 


1.0;0.125 


0.1;2.0;0.0625 


Gun-Point 


150 2 50 150 


256;1.0 


le-5;2048;2.0 


CBF 


128 3 30 900 


8;1.0 


0.01;1.0;0.0312 


Face (all) 


131|14 560 1690 


4;0.5 


le-5;16.0;0.062 


OSU Leaf 


427|6 200 242 


2;0.125 


0.01;16;0.25 


Swedish Leaf 


128 15 500 625 


128;0.125 


1.0;16;0.125 


50 Words 


270 50 450 455 


32;0.5 


0.01;32;0.25 


Trace 


275|4|100|100 


8;0.0156 


le-5;512;0.25 


Two Patterns 


128 |4|1000|4000 


4.0;0.25 


0.01;2.0;0.0625 


Wafer 


152|2|1000 6174 


4.0;0.5 


0.1;8.0;0.0625 


face (four) 


350 4 24 88 


8.0;2.0 


1000;8.0;2.0 


Ligthing2 


637 2 60 61 


2.0;0.125 


.01;1.0;0.125 


Ligthing7 


319 7 70 73 


32.0;256.0 


0.1;512.0;8.0 


ECG 


96 2 100|100 


8.0;0.25 


1000.0, 8.0, 0.25 


Adiac 


176 37|390|391 


1024.0;0.125 


1000;1024.0;0.125 


Yoga 


426 2 300|3000 


64.0;0.125 


1.0;16.0;0.0625 


Fish 


463|7|175 175 


64.0;1.0 


10.0;64.0;1.0 


Coffee 


286 2 28 28 


128.0;4.0 


1000.0;128.0;4.0 


OliveOil 


570 4 30 30 


2.0;0.125 


0.01;4.0;0.125 


Beef 


470 5 30 30 


128.0;4.0 


1000.0;128.0;4.0 



TABLE 2 
Meta parameters used in conjunction with Serp, STWKerp, Sdtw, STWKdtw, Stwed and STWK 



twed 



kernels 



DATASET 


Serp ■■ g;C;cj 


STWKerp -.giu'-.C^u 


^dtw ■ C; u 


STWKdtw ;i.';C;a 


5tn,ed ■ X;i/:C;a 


STWKtyjed ■■ \\U]u';C;cj 


Synth, cont. 


0.0;2.0;0.25 


0.0;0.457;256.0;0.062 


8.0;4.0 


0.047;1024.0;0.062 


0.75;0.01;1.0;0.25 


0.75;0.01;0.685;8.0;4.0 


Gun-Point 


-0.35;4.0;0.031 


-0.35;0.457;128.0;1.0 


16.0;0.0312 


0.457;64.0;2.0 


0.0;0.001;8.0;1.0 


0.0;0.001;0.685;32;32 


CBF 


-0.11;1.0;1.0 


-0.11;0.203;4.0;32.0 


1.0;1.0 


0.457;2.0;1.0 


1.0;0.001;1.0;1.0 


1.0;0.00;,0.20;4.0;32.0 


Face (all) 


-1.96;4.0;0.5 


-1.96;1.028;8.0;0.62 


2.0;0.25 


1.028;4.0;0.25 


1.0;0.01;8.0;4.0 


1.0;0.01;2.312;8.0;4.0 


OSU Leaf 


-2.25;2.0;0.062 


-2.25;1.541;256;0.031 


4.0;0.062 


1.541;32.0;0.062 


1.0;le-4;8.0;0.25 


1.0;le-4;1.028;64.0;1.0 


Swed. Leaf 


0.3;8.0;0.125 


0.3;0.203;1.0;4.0 


4.0;0.031 


5.202;0.062;0.5 


1.0;le-4;16.0;0.062 


1.0;le-4;0.304;32.0;1.0 


50 Words 


-1.39;16.0;0.25 


-1.39;0.685;16;0.25 


4.0;0.062 


1.028;64.0;0.062 


1.0;le-3;8.0;0.5 


1.0;le-3;1.028;32.0;2.0 


Trace 


0.57;32;0.62 


0.57;0.457;256;4.0 


4;0.25 


0.685;16;0.25 


0.25;le-3;8.0;0.25 


0.25;le-3;300;0.0625;0.25 


Two Patt. 


-0.89;0.25;0.125 


-0.89;0.304;0.004;1.0 


0.25,0.125 


0.457;2.0;0.125 


1.0;le-3;0.25;0.125 


1.0;le-3;0.685;0.25;0.125 


Wafer 


1.23;2.0;0.062 


1.23;0.685;4.0;0.5 


1.0;0.016 


1.541;1024;0.031 


1.0;0.125;4.0;0.62 


1.0;0.125;1.541;1.0;4.0 


face (four) 


1.97;64;16 


1.97;0.685;32;2 


16;0.5 


0.457;16;2 


1.0;0.01;4;2 


1.0;0.01;1.027;4;2 


Ligthing2 


-0.33;2;0.062 


-0.33;2.312;128;0.062 


2.0;0.031 


1.541;32;0.062 


0.0;le-6;8;0.25 


0.0;le-6;1.541;8;8 


Ligthingy 


-0.40;128;2 


-0.40;0.685;32;0.25 


4;0.25 


0.685;32;0.062 


0.25;01;4;0.5 


0.25;0.1;0.685;4;8 


ECG 


1.75;8;0.125 


1.75;0.457;16;0.5 


2;0.62 


1.028;32;0.062 


0.5;1.0;4;0.125 


0.5;1.0;5.202;8;16 


Adiac 


1.83;16;0.0156 


1.83;2.312;4096;0.031 


16;0.0039 


1.028;2048;0.031 


0.75;le-4;16;0.016 


0.75;le-4;2.312;128;l 


Yoga 


0.77;4;0.031 


0.77;11.7054096;0.031 


4;0.008 


26.337;1024;0.031 


0.5;le-5;2;0.125 


0.5;le-5;3.468;256;2 


Fish 


-0.82;64;0.25 


-0.82;0.685;32;0.5 


8;0.016 


3.468;64;16 


0.5;le-4;4;.5 


0.5;le-4;0.457;16;16 


Coffee 


-3.00;16;0.062 


-3.00;26.337;4096;16 


8;0.062 


5.202;512;4 


0;0.1;16;4 


0;0.1;300;1024;128 


OliveOil 


-3.00;8;0.5 


-0.82;0.457;256;0.062 


2;0.125 


0.457;32;0.125 


0;0.001;256;32 


0;0.001;32;32 


Beef 


-3.00;128;0.125 


-3.00;0.685;0.004;16384 


16;0.016 


0.457;0.004;16 


0;le-4;2;l 


0;le-4;0.135;0.004;16 
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TABLE 3 

Comparative study using tine UCR datasets: classification error rates (in %) obtained using tine first near neiginbor 

classification rule and a SVM classifier for the Ked and STWKtu]tp2 kernels. Two scores are given S1 |S2: the first 

one, S1 , is evaluated on the training data, while the second one, S2, is evaluated on the test data. 



DATASET 


1-NN ED 


1-NN STWKty,,p2 


SVM K^a 


SVM STWKt^,^^ 


Synthetic control 


8.7|12 


.33|.67 


2.33|2 


1.67|.33 


Gun-Point 


4.08|8.67 


4.08|8.67 


6|6 


2|3.33 


CBF 


17.24|14.8 


6.9 1 2.33 


3.3110.89 


3.33 1 3.44 


Face (all) 


11.27 28.64 


6.98|24.37 


6.07|16.63 


4.29 23.79 


OSU Leaf 


38.19 48.34 


33.17 45.04 


32|43.80 


29.5 44.21 


Swedish Leaf 


25.05 21.12 


23.25 20.19 


23.25 20.19 


12.8 1 8.8 


50 Words 


36.52 36.92 


33.18 34.28 


32.45 30.10 


31.11|29.67 


Trace 


16.16|24 


11.111 23 


7 24 


3|10 


Two Patterns 


8.61|9.32 


4.8 |4.17 


8.6 7.45 


5.5|4.1 


Wafer 


0.7|0.45 


.4|.67 


.8.52 


.2|0.82 


face (four) 


34.78 21.59 


34.78 21.59 


25 114.77 


20.83|22.73 


Ligthing2 


28.81 24.59 


22.03 16.39 


21.77 32.79 


20 21.31 


Ligthing7 


36.23 42.47 


28.98 31.51 


37.14 36.98 


30 36.98 


ECG 


14.14 1 12 


14.14 1 12 


89 


89 


Adiac 


39.59 38.87 


39.59|38.87 


25.13 25.83 


25.13 24.29 


Yoga 


23.08 16.97 


20|16.83 


16|14.87 


15.331 14.7 


Fish 


24.14 21.71 


24.14 21.71 


13.14|13.14 


13.14|12.57 


Coffee 


22.22|25 


18.52 17.85 


0|0 


0|3.57 


OliveOil 


13.79 13.33 


13.33 13.33 


10|13.33 


6.37| 13.33 


Beef 


51.72 46.67 


48.281 50 


30 30 


30 30 


# Best Scores 


48 


20 18 


49 


20 15 


# Uniquely Best Scores 


02 


15 12 


05 


15 11 



1-NN 5 ,v.s. 1-NN 5^ . 
ed twip 




Fig. 3. Comparison of error rates (in %) between two 1-NN classifiers based on the Euclidean Distance (1-NN ED), 
6ed, and the distance Stwip2 induced by a time-warp inner product (1-NN TWIP). The straight line has a slope of 1 .0 
and dots correspond, for the pair of classifiers, to the error rates on the train (star) or test (circle) data sets. A dot below 
(resp. above) the straight line indicates that distance Stwtp2 has a lower (resp. higher) error rate than distance 5ed 
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SVM8 ,v.s. SVMS 




SVM6 ^ 

ed 

Fig. 4. Comparison of error rates (in %) between two SVM classifiers, tine first one based on tine Euclidean Distance 
Gaussian kernel (SVM Ked), and the second one based on a Gaussian kernel induced by a time-warp inner product 
(SVM STWKtwip^)- The straight line has a slope of 1 .0 and dots correspond, for the pair of classifiers, to the error rates 
on the train (star) or test (circle) data sets. A dot below (resp. above) the straight line indicates that SVM STWKtwtp2 
has a lower (resp. higher) error rate than distance SVM Ked 



TABLE 4 
Comparative study using the UCR datasets: classification error rates (in %) obtained using the first near neighbor 
classification rule and a SVM classifier for the erp, STWKerp, dtw and STWKdtw kernels. Two scores are given 
S1 |S2: the first one, S1 , is evaluated on the training data, while the second one, S2, is evaluated on the test data. 



DATASET 


1-NN 5erp 


SVM Serp 


SVM STWKerp 


1-NN 5dt» 


SVM Sat^ 


SVM STWKaty, 


Synthetic control 


0.67|3.7 


1.33 


.33|1.33 


1.0|0.67 


0|2.33 


0|1 


Gun-Point 


6.12|4 


1.33 


11.33 


18.36|9.3 


8|10 


11.33 


CBF 


0|0.33 


3.33|3.56 


3.33 1 3.22 


0|0.33 


3.33 1 1.44 


3.33|5.44 


Face (all) 


10.73|20.18 


.89|18.1 


.54 116.86 


6.8|19.23 


4.47|15.32 


.54|16.98 


OSU Leaf 


30.15|40.08 


25|35.95 


11.5|30.57 


33.17 40.9 


29.5|43.8 


20|23.55 


Swedish Leaf 


11.02|12 


9.2|7.36 


7 16.24 


24.65 20.8 


21.8|18.56 


7|5.6 


50 Words 


19.38|28.13 


24.98 


24.61 


16.32 


16.04 


33.18|31 


31.66 


29.45 


15.21 


17.58 


Trace 


10.01|17 





1 





1 

















2 


Two Patterns 


0|0 
































Wafer 


.10.9 


.1 0.89 


0|0.44 


1.4|2.01 


2|2.95 


0|0.39 


face (four) 


4.35 


10.2 


8.33|4.55 


4.17|3.4 


26.09|17.05 


12.5 


12.5 


8.33 1 5.68 


Ligthing2 


11.86 


14.75 


10|18.03 


11.67 


19.67 


13.56 


13.1 


18.33 


24.59 


8.33|19.67 


Ligthing7 


23.19|30.1 


18.57|16.43 


18.57 


17.81 


33.33 


27.4 


27.15 


21.91 


17.14 116.43 


ECG 


10.01|13 


15|9 


9|13 


23.231 23 


12 


17 


7|13 


Adiac 


35.99|37.85 


29.74|30.94 


25.74|24.04 


40.62|39.64 


38.46 


34.52 


24.61 125.32 


Yoga 


14.051 14.7 


14|12.1 


11|11.47 


16.37|16.4 


19.33 


16.87 


11|11.2 


Fish 


16.09 12 


9.71|9.71 


6.86 4.57 


26.44 16.57 


21.72 


19.43 


6.86 4.57 


Coffee 


25.93 25 


7.14| 17.85 


10.71 14.29 


14.81 17.86 


10.71|7.14 


10.71 17.86 


OliveOil 


17.24|16.67 


13.33 16.67 


13.33 16.67 


13.79 13.33 


10 116.67 


13.33 16.67 


Beef 


68.97|50 


36.67 46.67 


43.33|50 


55.17|50 


36.67|50 


32.14 42.85 


# Best Scores 


- 


10/9 


16/16 


- 


6/6 


19/16 


# Uniquely Best Scores 


- 


4/4 


10/11 


- 


1/4 


14/14 
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Fig. 5. Comparison of error rates (in %) between two SVM classifiers, tine first one based on tine 8erp substituting 
kernel (SVM (5erp), and the second one based on an additive time-warp kernel induced by the ERP distance (SVM 
STWK^r^. The straight line has a slope of 1 .0 and dots correspond, for the pair of classifiers, to the error rates on 
the train (star) or test (circle) data sets. A dot below (resp. above) the straight line indicates that SVM STWKerp has 
a lower (resp. higher) error rate than distance SVM Serp 
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Fig. 6. Comparison of error rates (in %) between two SVM classifiers, the first one based on the hdtw substituting 
kernel (SVM iatw), and the second one based on an additive time-warp kernel induced by the Sdtxu distance (SVM 
STWKdtiu))- The straight line has a slope of 1 .0 and dots correspond, for the pair of classifiers, to the error rates on 
the train (star) or test (circle) data sets. A dot below (resp. above) the straight line indicates that SVM STWKdtw has 
a lower (resp. higher) error rate than distance 6dtw 
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TABLE 5 

Comparative study using tine UCR datasets: classification error rates (in %) obtained using tine first near neighbor 

classification rule and a SVM classifier for the Stwed and STWKtwed kernels. Two scores are given S1 |S2: the first 

one, S1 , is evaluated on the training data, while the second one, S2, is evaluated on the test data. 



DATASET 


1-NN 5t„,d 


SVM 5,„,d 


SVM STWKt^.a 


Synthetic control 


12.33 


1.33 


0|1.33 


Gun-Point 


1.33 


0.67 


0|0 


CBF 


0.67 


3.33|3.67 


2.44|1.33 


Face (all) 


1.431 18.93 


0.56|16.86 


0.72|15.09 


OSU Leaf 


17.59|24.79 


15|18.18 


19|22.73 


Swedish Leaf 


8.82| 10.24 


7.2 


6.4 


6.6 15.12 


50 Words 


18.26|18.9 


15.66 


14.51 


15.12 


16.26 


Trace 


1|5 











1 


Two Patterns 


0|0.12 


0|0.025 








Wafer 


.1 


.86 


0.1|0.41 


0.1 1 0.37 


face (four) 


8.7|3.41 


8.33|2.27 


8.33|3.4 


Ligthing2 


13.56 21.31 


15|21.31 


11.67 21.31 


Ligthing7 


24.64 24.66 


25.29 1 23.29 


25.29 23.29 


ECG 


13.13|10 


12|8 


12|7 


Adiac 


36.25|37.6 


30.51|31.02 


24.87 23.53 


Yoga 


19.06|12.97 


12|9.9 


12.33 10.83 


Fish 


12.07|5.14 


4.57|2.86 


6.29 3.43 


Coffee 


18.52 21.43 


25 28.5 


10.61 17.86 


OliveOil 


11.11 16.67 


13.33 113.33 


13.33 13.33 


Beef 


58.62|53.3 


36.67|53.33 


46.671 50 


# Best Scores 


- 


13|10 


15|14 


# Uniquely Best Scores 


- 


5|6 


610 



S™S,„3,v.s.SVMSTW^^^, 




Fig. 7. Comparison of error rates (in %) between two SVM classifiers, the first one based on the Stwed substituting 
kernel (SVM Stwed), and the second one based on an additive time-warp kernel induced by the ERP distance (SVM 
STWKtwed)- The straight line has a slope of 1 .0 and dots correspond, for the pair of classifiers, to the error rates on 
the train (star) or test (circle) data sets. A dot below (resp. above) the straight line indicates that SVM STWKtwed has 
a lower (resp. higher) error rate than distance SVM Stwed 
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5.1 Additive STWK 

We tested the additive STWK based on the Time 
Warp Inner Product < A,B >twip2 (EqHll (we choose 
to test twip2 because for large v it tends towards 
the Euclidean inner product). Precisely, we used the 
time-warp distance induced hy < A,B >twip2i basically 



^twip2 



{A,B) = {<A-B,A-B> 



twip2 ) 



\l/2 



5.1.1 Meta parameters 

Stwip2 is characterized by the meta parameter v (the 
stiffness parameter) that is optimized for each dataset 
on the train data by minimizing the classification error 
rate of a first near neighbor classifier. For this kernel, i^ 
is selected in {100, 10, 1, .1, .01, ..., le - 5}. 

To explore the potential benefits of TWIP against the 
Euclidean inner product, we also tested the Euclidean 
Distance 6ed which is the limit when i^ ^- oo of Stwip2 ■ 

The kernels exploited by the SVM classifiers 
are the Gaussian kernels STWKtwip2 (A, B) — 



„5t>„,P2(A,B)V(2-<T") 



and K,diA,B) = e^-'^^^'^ 



r/i^-a") 



C is selected into the 



into 



The meta parameters 

discrete set {2-'^,2-'^, ...,l,2, ...,2^°}, and a^ 

{2-5, 2-4, ...,1,2,..., 210}. 

Table [1] gives for each data set and each tested kernels 
{Ked and STWKfyjip^) the corresponding optimized 
values of the meta parameters. 



5.2 lUlultiplicative STWK 

We tested the multiplicative exponentiated STWK 
based on the Serp,Sdtw,Stwed distance costs. 
We consider respectively the positive definite 
STWKerp, STWKdt^, STWKt^ed kernels. 
Our experiment compares classification errors on the 
test data for: 

• the first near neighbor classifiers based on the 
5erp, Sdtw,Stwed distance measures (1-NN 6erp, 1-NN 
Sdtw and 1-NN Suoed), 

• the SVM classifiers using Gaussian distance sub- 
stituting kernels based on the same distances and 
their corresponding STWK, e.g. SVM Serp, SVM 
STWKerp, SVM Sdtw, SVM STWKdt^, SVM St^^d, 
SVM STWKtwed. 

For Serp, 5twed, STWK^^p and STWKtwed we used 
the LI -norm, while the L2-norm has been implemented 
for 6dtw and STWKdtw, a classical choice for DTW |20| . 



5.2. 1 Meta parameters 

For 5erp kernel, meta parameter g is optimized for 
each dataset on the train data by minimizing the 
classification error rate of a first near neighbor classifier 
using a Leave One Out (LOO) procedure. For this kernel, 
g is selected in {-3, -2.99, -2.98, • • • , 2.98, 2.99, 3}. This 



optimized value is also used for comparison with the 
STWKrne{ERP) kernel. 

For Stwed kernel, meta parameters A and v 
are optimized for each dataset on the train 
data by minimizing the classification error 
rate of a first near neighbor classifier. For our 
experiment, the stiffness value (i') is selected from 
{10-^10-'', 10-3, 10-2, 10-1,1} and A is selected from 
{0, .25, .5, .75, 1.0}. If different {v,X) values lead to the 
minimal error rate estimated for the training data then 
the pairs containing the highest u value are selected 
first, then the pair with the highest A value is finally 
selected. These optimized (A, z^)values are also used for 
comparability purposes with the STWKtwed kernel. 

The kernels exploited by the SVM classifiers are 
the Gaussian Radial Basis Function (RBF) kernels 

K{A,B) = ^S{A,Bf/{2-a^) ^j^gj.g g stands for 

Serp, Sdtw, Stwed, STWK.rp, STWKdtw, STWKtwed- Meta 
parameter C is selected from {2-5,2-1, ..., 1,2, ...,2i"}, 
and 0-2 from {2-^, 2-"^, ..., 1, 2, ..., 2i°}. The best values 
are obtained using a cross validation procedure. 

For the STWK^rp, STWKdtw and STWKtwed kernels, 
meta parameter 1/v' is selected from the discrete set S = 

{10-5, lo-^..., 1,10, 100}. 

The optimization procedure is as follows: 

• for each value in S, we train a SVM STWK^, clas- 
sifier on the training dataset using the previously 
described 5-folded cross validation procedure to 
select the SVM meta parameters cost and a and the 
average of the classification error is recorded. 

• the best a, C and i^' values are the ones that lead to 
the minimal average error. 

Table |2] gives for each data set and each tested 



kernel {S^ 



^dtw, Otwed, 



STWKerv, STWKdtw and 



STW Ktwed) the corresponding optimized values of the 
meta parameters. 



5.3 Discussion 

5. 3. 1 Additive S TWK experiment analysis 
Table |3] shows the classification error rates obtained for 
the tested methods, e.g. the first near neighbor classi- 
fier based on the Euclidean Distance and the distance 
induced by the time-warp inner product (1-NN ED and 
1-NN Stiuip^), the Gaussian RBF kernel SVM based on the 
Euclidean distance and the distance induced by the time- 
warp inner product (SVM K^d and SVM STWKtwip2)- 

This experiment shows that the time-warp inner 
product is significantly more effective for the considered 
tasks comparatively to the Euclidean distance, since it 
exhibits, on average, the lowest error rates for most 
of the tested datasets for both the 1-NN and SVM 
classifiers, as shown in Table |3] and Figures |3] and ID The 
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Oerp 3nd Stwed 



TABLE 6 
Analysis of the deviation to conditionally definiteness for the gram-matrices associated to the 5dtu 
distances. We report for each dataset the number of positive eigenvalues (#Peu) relatively to the total number of 
eigenvalues (#-Eu)and the deviation to definiteness estimated as Ap that expresses in %. The expectation is a single 

positive eigenvalue, #Pev = 1, corresponding to Ap = 0%. 



DATASET 


<^dtw 


Serp 


^twed 1 


- 


#Pev/#Ev 


A, 


#Pev/#Ev 


A, 


#Pev/#Ev 


A, 


Synthetic control 


110/300 


15.66% 


8/300 


.16% 


6/300 


.22 % 


Gun-Point 


23/50 


2.54% 


1/50 


0% 


1/50 


0% 


CBF 


5/30 


3.36% 


1/30 


0% 


1/30 


0% 


Face (all) 


242/560 


26.6% 


83/560 


2.42% 


41/560 


1.89% 


OSU Leaf 


96/200 


31.79% 


29/200 


2.97% 


16/200 


.89% 


Swedish Leaf 


206/500 


17.04% 


24/500 


.68% 


23/500 


.41% 


50 Words 


218/450 


34.03% 


119/450 


9.54% 


93/450 


4.85% 


Trace 


43/100 


5.42% 


1/100 


0% 


1/100 


0% 


Two Patterns 


453/1000 


36.7% 


259/1000 


13.8% 


226/1000 


9.85% 


Wafer 


497/1000 


14.84% 


137/1000 


1.29% 


39/1000 


.04% 


face (four) 


2/24 


.74% 


1/24 


0% 


1/24 


0% 


Ligthing2 


20/60 


13.44% 


1/60 


0% 


1/60 


0% 


Ligthing? 


24/70 


14.25% 


1/70 


0% 


1/70 


0% 


ECG 


38/100 


14.7% 


1/100 


0% 


1/100 


0% 


Adiac 


159/390 


5.54% 


26/390 


.82% 


39/390 


.69% 


Yoga 


142/300 


23.4% 


29/300 


3.17% 


10/300 


.41% 


Fish 


71/175 


17.57% 


1/175 


0% 


1/175 


0% 


Coffee 


12/28 


8.83% 


1/28 


0% 


1/28 


0% 


OliveOil 


4/30 


.24% 


1/30 


0% 


1/30 


0% 


Beef 


15/30 


6.17% 


1/30 


0% 


1/30 


0% 



stiffness parameter in Stwip2 seems to play a significant 
role in these classification tasks, and this for a quite 
large majority of data sets. 

Only one dataset. Beef, is better classified by the 1-NN 
ED classifier on the test data, although the error rate 
on the train data is lower for the 1-NN Stwip2 classifier. 
As the train data for the beef dataset is quite small (30 
instances), the significance of this specific result is not 
clear. For the SVM classifiers, only two datasets. Face 
(all) and Face (four), are significantly better classified 
on the test data by SVM Ked classifiers. Nevertheless, 
for these two datasets, STWKtwip2 reaches a better corresponding to the Sdtwr Serp and Stwed distances. Since 



[7] is that SVM STWR^rp and SVM STWKt^oed perform 
slightly better than SVM S^rp and SVM Stwed respectively, 
and the SVM STWKdtw is clearly much efficient than the 
SVM 5dtw This could come from the fact that 5erp and 
Stwed are metrics but not Sdtw- SVM Sdtw behaves poorly 
compared to the other tested classifiers probably because 
the SVM optimization process does not perform well. 
Nevertheless, the STWKdtiu kernel based on Sdtw seems 
to correct greatly its drawbacks. To explore further the 
potential impact of rndefiniteness on classification error 
rates, we give in Table [6] two quantified hints of devia- 
tion to conditionally definiteness for the gram-matrices 



score on the train data. We are facing here the trade- 
off between learning and generalization capabilities. The 
meta parameter v is selected such as to minimized the 
classification error on the train data. If this strategy is on 
average a winning strategy, some datasets show that it 
does not necessarily lead to a good trade-off, this is the 
case for Face (all) and Face (four) datasets. 

5.3.2 Multiplicative STWK experiment analysis 

Tables H] and |5] show the classification error rates ob- 
tained for the tested methods, e.g. the first near neighbor 
classifier based on the Serp, Sdtw and Stwed distances (1- 
NN Serp, 1-NN Sdtw and 1-NN Stwed), the Gaussian RBF 
kernel SVM based on the same distances (SVM Serp, SVM 
Sdtw and SVM Stwed) and Euclidean distance and the 
Gaussian RBF kernel SVM based on the STWK kernels 
(SVM STWKerp, SVM STWKdtw and SVM STWKtwed)- 
In this experiment, we show that the SVM classifiers 
clearly outperform the 1-NN classifiers. But the interest- 
ing results reported in tables S] and |5] and figures |5j|6] and 



a conditionally (negative) definite gram-matrix is char- 
acterized by a single positive eigenvalue, the first hint is 
the number of positive eigenvalues jj^Pev (we give also 
as a reference the total number of eigenvalues, i^Ev). 

The second hint, A„ = 100 * ^"-^"^ ' — "-^"^ \ 

where evi is an eigenvalue of the gram matrix, quantifies 
the weight of the extra positive eigenvalues relatively 
to the weight of the total number of positive eigenval- 
ues. Therefore, a conditionally definite (negative) gram- 
matrix should be such that simultaneously jj^Pev = 
and Ap = 0. By examining the gram-matrices corre- 
sponding to each training datasets and for each distances 
Sdtw{A, B), Serp{A, B) and Stwed{A, B), we can show that 
the Sdtw kernel is much more far away from a condi- 
tionally definite matrix than the Serp and Stwed kernels. 
The distance that is closer to conditional definiteness 
is the Stwed distance. This is clearly measurable by the 
number of positive eigenvalues and their amplitudes. 
Furthermore, for datasets of small sizes (such as CBF, 
Beef, Coffee, OliveOil, etc.), Serp and Stwed kernels produce 
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conditionally definite Gram-matrices when Sdtw does 
not. The regularization brought by STWK is therefore 
more effective on Sduu- This is the case for instance on the 
Beef dataset for which, on the train data, Stwed performs 
slightly better than the STW Ktwed- In this case, both 
kernels lead to a definite gram-matrix, and the extra 
parameter v' in use in the STW Ktwed kernel explains 
probably a poorer classification rate due to a lack of 
learning data. Nevertheless, similarly to the additive 
STWK, few datasets are better classified by SVM that use 
directly the distance kernel instead of the derived STWK 
kernel. The same reasons mentioned above in the case 
of additive STWK can be invoked here also. The extra 
parameter v' makes the search for an optimal setting on 
the train data more difficult and requires more learning 
data to converge. The trade-off between learning and 
generalization is therefore even more complex. 



Appendix A 

INDEFINITENESS OF CLASSICAL ELASTIC MEA- 
SURES 

A.1 The Levenshtein distance 

The Levenshtein distance kernel (p{x,y) — Siev{x,y) 
is known to be indefinite. Below, we discuss the first 
known counter-example produced by [7J. Let us consider 
the subset of sequences V — {abc, bad, dab, adc, bed} that 
leads to the following distance matrix 

/ 3 2 1 2 \ 

3 2 2 1 

2 2 3 3 (5) 

12 3 3 

V 2 1 3 3 / 



ML 



6 Conclusion 

Following the work on convolution kernels IITTl and local 
alignment kernels defined for string processing around 
the Smith and Waterman algorithm [27J [21J, we propose 
summative time-warp kernels (STWK) applicable for 
string and time series processing. We give some simple 
sufficient conditions to build positive definite STWK. 
Our generalization leads us to propose additive and 
multiplicative STWK. For multiplicative STWK, we show 
that, for the exponentiated version we have tested, the 
sufficient conditions are basically satisfied by classical 
elastic distances defined by a recursive equation. In par- 
ticular this is true for the edit distance, the well known 
Dynamic Time Warping measure and for some variants 
such as the Edit Distance With Real penalty and the Time 
Warp Edit Distance, the latter two being metrics as well 
as the symbolic edit distance. From the general additive 
STWK definition we have suggested a time-warp inner 
product (TWIP) from which a metric (or norm) that 
generalizes the Euclidean distance (or Euclidean norm) 
is induced. The experiments conducted on a variety of 
time series datasets show that the multiplicative posi- 
tive definite STWKs outperforms the indefinite elastic 
distances they are derived from when considering 1- 
NN and SVM classification tasks, specifically when the 
gram-matrix associated to the elastic distance is far from 
definiteness. 

Our experiments also show that the additive STWK 
we constructed from the proposed instance of TWIP 
(tu'zp2)significantly outperforms the kernels derived 
from the Euclidean inner product. This time-warp inner 
product opens some interesting perspectives since it 
leads to reconsidering the notion of orthogonality in 
discrete time series spaces. In particular, in such spaces 
provided with a TWIP the discrete sine and cosine 
waveforms are no longer orthogonal. Is it therefore 
relevant to raise the issue of a discrete elastic Fourier 
transform? 



and consider coefficient vectors C and D in 
that 

^5 



such 



C = [1, 1, -2/3, -2/3, -2/3] with J^Ui c« = and 
D = [1/3, 2/3, 1/3, -2/3, -2/3] with ^^i d^ = 0. 

Clearly C ■ M^, • C^ = 2/3 > and 
D ■ ML, ■ D^ = -4/3 < 0, showing that M^ has 
no definiteness. 



A.2 The Dynamic Time Warping distance 

The DTW kernel (p{x,y) = ddttu{x,y) is also known 
not to be conditionally definite. The following example 
demonstrates this known result. Let us consider the 
subset of sequences V = {01,012,0123,01234}. 

Then the DTW empiric gram matrix evaluated on V 
is 



ML 



V 






1 


2 


3 \ 


1 











2 











3 








oy 



(6) 



and consider coefficient vectors C and £> in R"' such 
that 

C = [1/4, -3/8,-1/8, 1/4] with Yfi=i c^ ^ and 
D = [-1/4,-1/4,1/4,1/4] with ELi ^' = 0- Clearly 
C ■ Ml^ ■ C^ - 2/32 > and i? • M^^ ■D^ = -l/2< 0, 
showing that Af^^ has no definiteness. 



A.3 The Time Warp Edit Distance 

Similarly, it is easy to find simple counter examples that 
show that TWED kernels are not definite. 

Let us consider the subset of sequences V = 
{010, 012, 103, 301, 032, 123, 023, 003, 302, 321}. 

For the TWED metric, with v = l.Q and A = 0.0 we 
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get the following matrix: 



M, 



twed 



( ^ 


2 


7 


9 


6 


7 


5 


5 


10 


9 \ 


2 





5 


9 


4 


5 


3 


3 


8 


9 


7 


5 





6 


7 


4 


6 


2 


5 


10 


9 


9 


6 





13 


10 


12 


8 


1 


4 


6 


4 


7 


13 





5 


3 


5 


12 


9 


7 


5 


4 


10 


5 





2 


6 


9 


6 


5 


3 


6 


12 


3 


2 





4 


11 


8 


5 


3 


2 


8 


5 


6 


4 





7 


10 


10 


8 


5 


1 


12 


9 


11 


7 





5 


V 9 


9 


10 


4 


9 


6 


8 


10 


5 


/ 



(7) 



The eigenvalue spectrum for this matrix is the 
following: 

{4.62, 0.04, -2.14, -0.98, -0.72, -0.37, -0.19, -0.17, 
—0.06, —0.03 }. This spectrum contains 2 strictly positive 
eigenvalues, showing that M^^ 



lYwed has no definiteness. 



A.4 The Edit Distance with Real Penalty 

For the ERP metric, with g = 0.0 we get the following 
matrix: 





( ^ 


2 


3 


3 


4 


5 


4 


2 


4 5 \ 




2 





3 


5 


2 


3 


2 


2 


4 5 




3 


3 





4 


3 


2 


3 


1 


3 4 




3 


5 


4 





7 


6 


7 


5 


1 2 


-'"erp 


4 
5 


2 
3 


3 

2 


7 
6 



3 


3 




2 

1 


2 
3 


6 5 

5 4 




4 


2 


3 


7 


2 


1 





2 


6 5 




2 


2 


1 


5 


2 


3 


2 





4 5 




4 


4 


3 


1 


6 


5 


6 


4 


1 




K^ 


5 


4 


2 


5 


4 


5 


5 


1 o; 


The eigenvalue 


spect 


rum for 


this matrix 


following: 


















{4.63, 0.02, 1 


.39e 


- 


17, 


- 


2.21 


, - 


-0.97, 


-0.56, 



(8) 



IS the 

-0.41, 

—0.26, —0.17, —0.08 }. This spectrum contains 3 strictly 
positive eigenvalues (although the third positive 
eigenvalue which is very small could be the result of 
the imprecision of the used diagonalization algorithm), 
showing that M] 



lYrp has no definiteness. 



Appendix B 

Proofs of main propositions 

B.I Proof of theorem lOl 

i) Let us show that if the function: 

f(x,y) = /(r(x -^ y)) : {{S xT)U {A}f ^ M is positive 
definite and if ^ > , then an additive or multiplicative 
STWK is definite positive. 

Let us first denote < A,B >ij=< A\,B{ > the 
restriction of the STWK up to index i and j such that 
0<i< |A| and 0< j < \B\. 



Furthermore let ka(i)-+s(j)(^7^)/ ^iK^B{j)[A,B) and 
K^(i)^A(^i B) be local kernels defined on U^ as follows: 

. y{A,B) e U2, KA^Ci^BU){A,B) = .f{T{A(i) ^ i?(j))) 
if < i < 1^1 and < j < \B\, k,^j{A,B) = 
otherwise. 

. V(A,B) e U2, K^^BU){A,B) = /(r(A ^ B{j))) if 
< j < \B\, KA.j{A, B) = otherwise. 

. y{A,B) e U2, k^(,)^a(A,B) - f{r{A{i) ^ A)) if 
< j < 1^1, Ki,A(A, B) =0 otherwise. 

If f{x,y) - /(r(x ^ y)) : {{S x T) U {A})^ ^ M 
is PD on (5 X T) U {A}, we directly establish that 
the local kernels KA(i)^s(i)(^7^)/ i^a^b(j){A,B) and 
K^(j)^A.(^i _B) are PD kernels on U x U. 

Let us define an alignment path between any 
pair (A,B) of sequences in U: the alignment of an 
element A{ik) of A with an element B{jk) of B is 
defined by a couple pk = [ikijk), with < u- < \A\ 
and < ik < \B\. An alignment path for a pair 
of sequences [A, B) is defined by a sequence 
TT = 7r(l),7r(2), • • • ,7r(fc), • • • ,7r(A') such that the 
sequences ik,k = 1,- ■ ■ ,K and j^, fc = 1, • • • , iiT are non 
decreasing {it-i < ik,jk~i < jk) and verifies either 
(ifc_i < ik) or (jfc_i < jk) for all fc. 

Each alignment path n = 7r(l),7r(2), • • • ,tt{K) 
uniquely characterizes a sequence of elementary editing 
operations 7 = 7(1),7(2), • • • ,7(i^), where each j{k) 
belongs to either a match {A{ik) — >• B{jk)), an insertion 
(A -^ B{jk)) or a deletion (^(i^) -> A). 

And finally let £{A,B) denotes the set of all existing 
editing sequences between sequences A and B. Note 
that £{A, B) is finite iff A and B are finite sequences. 

Proposition B.l: For any pair {A,B) e U^, any {i,j) 
such that < i < 1^1 and < j < \B\ we have 

-k 

<A\,Bi>^ E ^* n ^-^(kMlBi) (9) 
fee{A\,B() fe=i---l7l 



where 



n 



is either the product operator if * is the multiplication, 
or the sum operator if • is the addition. 

We prove proposition IB. II by induction on r = i + j. 
The proposition is true for r ~ 0, since then i = j = 
and then < Af,Bf >=< A, A >= ^. The proposition is 
true for r = 1, since 

. <AO,i?i>=<A?,i?o>*/(r(A^i?(i))) 

= ^*'«A^B(i)(^?,Sj), and 
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. <A\,B1 >=< Al Bl > */(r(A(l) ^ A)) 

Let suppose proposition IB. II is true for all n < r for 

r > 1 and let show that it is true for r + 1. 

First case: If i > and j > 0, then by definition of 
< . , . > we have 



(ai,a2, ■ ■ ■ 
have 



, a|y[) e R'^I, according to proposition lB.ll we 



E 



O^mO^n *^ -^mi -^n ^" 



2^0LjYiOLfi 



E 



7e£(^rT'.A:rr') 



l-4„|^ 



n '*7(fc)(^mT' 
fc=l---|7| 



A; 



< Ai, B^' >=< Ar\ B^' > */(r(A(*) ^ A)) 



+ <^r\Br >*/(r(Aw^i3(j))) 



(10) 



<Ai,i?r'>*/(r(A^i?(j))). 



which rewrites 



< A1,B^' >=< A'^\B{ > *Kr(A(i)^A){A\,Bi) 

+ < A[-\Bi-^ > *Kr^Ai,)^BU))iA\,Bi) 

+ < A\,Bl-^ > irKriA^BU)){A\,Bl). 



(11) 



Since ^ > 0, the decomposition of the STWK kernel 
results in a sum of product (if * is the multiplication, 
with ^ > 0) or a sum of sum (if * is the addition, 
with ^ = 0) of local kernels applying on the same 
arguments (proposition IB. Il l that are all positive definite 
by asumption, and since positive definite kernels are 
closed under summation or multiplication ||3|, the 
STWK is proved to be definite positive. 

As the sum of CPD or ND kernels are also CPD or 
ND, similarly to i), proposition IB. II directly leads to a 
proof of ii) and iii). D 



The three terms < A\^^,Bl >, < A\^\bI ^ > and 
< A\,Bi^^ > in the right hand side of the previous 
equality enter into the inductive hypothesis and thus 
decomposes as follows: 





<A\-\Bi >= 


-reSiAl-KBi 


) k 


11 

= l-|7l 


K^{k){A\. 


,Bi) 


< 


A\-\Bi-'> = 


1. 

-ie£iA\-\Bi-^ 


) '^ 


n 

= l-|7l 


i^-/(k){A\, 


,B{) 




<A\,Bi-'>-- 


- >; 

jeS{Ai,Bi-^ 




11 

= l-|7l 


K^(^k){A\, 


,B{) 



Recombining equations [TT] and [12] and completing the 
editing sequences we get the expected decomposition 
for <A\,B{ >. 



Second case: If i = ?- and j 
< ., . > we have necessarily 



0, then by definition of 



< AlBi >= e * /(r(A(l) ^ A)) ... * /(r(A(r) ^ A)) 

=c* n i^A(k}-,A{< A[, Bi >) 

fc=l---|7| 

which leads to the expected decomposition (a single 
editing sequence exists in that case). 



Third case: If i = and j 
similarly to the second case. 



r, the result is obtained 



Therefore proposition IB. II is true for r + 1. By 
induction, proposition IB.ll is true for all r G M+ U {0} D 



Proof of theorem: 

For all finite subset V 



{Ai,A2,-- ■ ,A\v\} of U and all 



B.2 Proof of proposition 14.11 1 

The proofs of i) and ii) are obtained using a similar 
recursion as the one used to prove theorem 14.41 iii) is 
immediate. 
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